Skip to content

route EthosU input/output memcpy through overridable hook (#19264)#19264

Merged
meta-codesync[bot] merged 1 commit intomainfrom
export-D103455766
May 6, 2026
Merged

route EthosU input/output memcpy through overridable hook (#19264)#19264
meta-codesync[bot] merged 1 commit intomainfrom
export-D103455766

Conversation

@3l1
Copy link
Copy Markdown
Contributor

@3l1 3l1 commented May 1, 2026

Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — arm_ethos_io_memcpy
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:

  • The weak default lives in its own TU so the compiler in the call-site
    TUs cannot inline its body and bypass the link-time override. This is
    the same pattern bolt_arm_memcpy_external uses.
  • Three call sites updated: input scratch copy in EthosUBackend.cpp, the
    layout-adjustment chunk loop in EthosUBackend.cpp, and the output
    scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766

@3l1 3l1 requested a review from digantdesai as a code owner May 1, 2026 21:06
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 20 Cancelled Jobs, 14 Pending, 7 Unrelated Failures

As of commit b6d333d with merge base cdcc915 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 1, 2026

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from ddea8da to ffc9927 Compare May 1, 2026 21:07
@3l1 3l1 added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label May 1, 2026
@3l1 3l1 requested a review from gggekov May 1, 2026 21:10
Copy link
Copy Markdown
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, like it!

Comment thread backends/arm/runtime/EthosUBackend_IoMemcpy.cpp
@meta-codesync meta-codesync Bot changed the title route EthosU input/output memcpy through overridable hook route EthosU input/output memcpy through overridable hook (#19264) May 5, 2026
meta-codesync Bot pushed a commit that referenced this pull request May 5, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from ffc9927 to 8eeb57c Compare May 5, 2026 22:00
meta-codesync Bot pushed a commit that referenced this pull request May 5, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-pytorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from 8eeb57c to 3fe2220 Compare May 5, 2026 23:59
meta-codesync Bot pushed a commit that referenced this pull request May 6, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from 3fe2220 to 845995e Compare May 6, 2026 00:13
@3l1
Copy link
Copy Markdown
Contributor Author

3l1 commented May 6, 2026

⚠️ NOTE: many failing tests - looking... (suspect missing inclusion in some build script)

// unit so the compiler in the call-site TUs cannot inline this body and
// bypass the link-time override (same trick as bolt_arm_memcpy_external).
extern "C" __attribute__((weak)) void
io_memcpy(void* dst, const void* src, size_t size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular memcpy should already be weak for embedded toolchain or we may be able to override through compiler flags but this is also OK.

Copy link
Copy Markdown
Contributor Author

@3l1 3l1 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we do not want a wide override in the subsystems or modules - eg. we enable this override to DMA on specific zephyr overlay configs for specific app_versions only, ie we want 'this specific workload only, copying tensors back and forth for the NPU' to be offloaded to hardware DMA since it also had its tradeoffs.

meta-codesync Bot pushed a commit that referenced this pull request May 6, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from 845995e to efec58a Compare May 6, 2026 06:09
@zingo
Copy link
Copy Markdown
Collaborator

zingo commented May 6, 2026

Coretx-M testing:
EthosUBackend.cpp:(.text._ZNK10executorch8backends3arm13EthosUBackend7executeERNS_7runtime23BackendExecutionContextEPvNS3_4SpanIPNS3_6EValueEEE[_ZNK10executorch8backends3arm13EthosUBackend7executeERNS_7runtime23BackendExecutionContextEPvNS3_4SpanIPNS3_6EValueEEE]+0xd2): undefined reference to arm_ethos_io_memcpy'`

This was an interesting side effect, It seem we are building the backend here when we probably should/could avoid it.

meta-codesync Bot pushed a commit that referenced this pull request May 6, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from efec58a to c4b0f13 Compare May 6, 2026 18:13
@meta-codesync meta-codesync Bot requested a review from larryliu0820 as a code owner May 6, 2026 18:13
@meta-codesync meta-codesync Bot requested a review from kirklandsign as a code owner May 6, 2026 18:13
3l1 added a commit that referenced this pull request May 6, 2026
Summary:
Pull Request resolved: #19264

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@3l1 3l1 force-pushed the export-D103455766 branch from c4b0f13 to eb64ab4 Compare May 6, 2026 18:25
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
meta-codesync Bot pushed a commit that referenced this pull request May 6, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch 2 times, most recently from 00b91bc to b6d333d Compare May 6, 2026 18:37
@meta-codesync meta-codesync Bot merged commit af90130 into main May 6, 2026
385 of 454 checks passed
@meta-codesync meta-codesync Bot deleted the export-D103455766 branch May 6, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants